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I.  INTRODUCTION 

The  use  of  distributed  data  bases  became  a practical  reality  with  the  implementation 
of  the  ARPANET.  Before  that  time  the  trend  in  computing  had  been  toward  centralized 
computing  resources  where  the  power  of  a large  machine  was  deemed  economical.  Con- 
comitants of  decentralization  are  decreased  communication  costs,  increased  efficiency  due 
to  computer  specialization,  and  increased  system  reliability  due  to  system  redundancy. 

With  a centralized  computer  system  all  remote  users  must  communicate  with  the  computer 
for  all  computer  interactions,  while  with  a computer  network,  users  may  do  most  of  their 
communications  with  their  local  computer,  only  occasionally  communicating  with  remote 
computers  for  data  or  programs.  Though  computers  can  do  many  tasks  well,  some  are  more 
efficient  at  particular  tasks  than  others.  Some  computers  may  perform  scientific  calcula- 
tions very  efficiently  while  others  may  perform  input/output  operations  very  efficiently. 
Currently,  no  computer  performs  all  operations  optimally.  Thus  a computer  network  can 
give  a user  access  to  a machine  which  can  handle  his  particular  problem  most  efficiently, 
while  in  a centralized  computer  system  the  mainframe  may  be  required  to  perform  many 
tasks  for  which  it  is  not  well  suited.  System  reliability  is  enhanced  with  a computer  network 
because  the  system  does  not  depend  on  just  the  operation  of  one  computer  complex.  If  one 
computer  system  should  go  down  most  users  still  have  access  to  the  other  computing 
resources  of  the  network,  while  in  a centralized  computing  system  if  the  computer  goes 
down  the  whole  system  is  unavailable  to  all  the  users.  For  these  reasons,  as  well  as  others, 
computer  networks  will  be  used  in  more  and  more  applications,  in  particular,  military 
applications. 

Military  systems  seem  particularly  suitable  for  implementation  in  a distributed 
computer  network.  In  military  systems  it  is  very  important  to  communicate  infonnation 
with  superior,  collateral  and  subordinate  commands.  It  is  also  important  that  each  command 
have  computing  power  available  to  it.  Computing  power  local  to  the  command  to  control 
equipment,  such  as  radars,  guns  and  missiles,  is  needed  at  each  command.  Much  of  the  infor- 
mation collected  by  commands  is  shared  with  the  other  commands.  Currently,  a great  por- 
tion of  this  communication  is  done  by  teletype  messages,  a slow  medium.  The  information 
contained  in  the  messages  is  not  necessarily  tailored  to  the  needs  of  the  unit  receiving  the 
messages  as  it  might  be  if  it  were  the  response  to  a query.  Computer  to  computer  communi- 
cation would  allow  less  duplication  of  information,  transmittal  of  more  relevant  information 
and  quicker  transmission  and  usage  of  the  information.  F'or  these  reasons  the  study  and 
eventual  efficient  implementation  of  military  distributed  data  bases  are  necessar\ . 
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I The  objective  of  our  work  is  the  ilesifin  of  efficient  computer  networks.  To  pursue 

this  objective  we  are  currently  investigaliiii’  the  problem  of  minimal  cost  allocation  of  files 
in  a computer  network.  In  this  report  we  review  much  of  the  previous  work  in  this  area  to 
establish  the  relationships  which  neeil  to  be  consiilereil  to  accurately  model  the  file  alloca- 
j tion  problem  in  a computer  network.  .Most  ol  the  previous  work  has  assumeil  complete 

I knowledge  of  the  data  base  parameters.  Solutions  to  the  minimal  cost  allocation  problem 

were  ti.xed  in  time,  either  in  a one-time  period  or  multi-time  period  problem.  In  our 

research,  of  which  this  paper  is  the  first  step,  we  want  to  consider  the  dynamic  adaptive  ! 

allocation  of  files. 

The  remainder  t)f  this  report  is  divideil  into  five  sections  which  lead  to  the  cata- 
loguing ot  relationships  which  will  be  used  in  modeling  adaptive  distrilnited  data  bases. 

Section  2 classilies  the  distributed  data  base  allocation  work  into  four  types  of  models: 
ileterministic  one-phase,  deterministic  multi-phase,  stochastic  discrete,  and  stochastic  con- 
tinuous. In  Secliim  .V  examples  ol  the  moilcls  ol  the  above  types  are  given,  and  the  rela- 
tionships and  assumptions  used  to  dcllnc  them  are  detailed  In  Section  4.  the  work  done  by 
previous  authors  is  critiipied.  In  Section  .S,  a list  of  relationships  is  compiled  which  will  be 
used  to  model  command  control  distributeil  data  base  systems.  A summary  of  this  work 
is  given  m Section  6. 
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11.  DISTRIBUTED  DATA  BASE  MODEL  CLASSIFICATION 

Models  which  liave  been  used  for  describing  distributed  data  bases  can  be  classified 
into  two  primary  groups.  The  two  primary  groups,  deterministic  and  stochastic,  can  be 
further  divided  into  two  subgroups  each.  The  four  classification  groups  for  distributed  data 
base  models  are: 

(i)  Deterministic  — one-phase 

(ii)  Deterministic  - multi-phase 

(iii)  Stochastic  - discrete  time 

(iv)  Stochastic  - continuous  time 

The  classification  is  based  upon  model  assumptions  and  does  not  necessarily  reflect  the  true 
environment  of  the  distributed  data  base  which  is  being  analyzed. 

The  distinction  between  detenninistic  and  stochastic  models  is  as  follows.  In  a deter- 
ministic model  all  of  the  relevant  information  is  assumed  to  be  known  ahead  of  time.  Such 
quantities  as  (1)  the  probability  that  a user  will  request  a specific  file,  (2)  the  rate  at  which  a 
file  is  used,  (3)  the  change  of  file  usage  patterns,  etc,  are  assumed  to  be  known  prior  to  sys- 
tem design.  In  addition,  the  model  parameters  do  not  change  unless  the  new  value  of  the 
parameter  and  the  time  when  it  changes  are  known.  On  the  other  hand,  stochastic 
models  allow  for  unknown  parameters.  These  parameters  are  usually  estimated  and  then  the 
distributed  data  base  dynamically  reconfigures  to  optimize  system  performance. 

Detenninistic  systems  are  categorized  into  one-phase  deterministic  systems  and 
multi-phase  periodic  detenninistic  systems.  In  a one-phase  deterministic  model  all  param- 
eters are  assumed  to  be  known  and  constant.  System  performance  is  optimized  for  the  fixed 
parameters  and  no  changes  are  made  to  the  data  base  after  this  initial  design.  A multi- 
phase periodic  deterministic  system  allows  changes  to  occur,  however,  these  variations  must 
be  known  exactly.  The  data  base  system  may  also  be  assumed  periodic.  For  example,  a day 
may  be  broken  into  an  8-hour  day  shift  and  a night  shift.  File  usage  rates  would  change 
when  the  shifts  change  and  the  times  of  the  shift  change  would  have  to  be  known  in  addition 
to  knowing  the  rate  changes.  As  can  be  seen  from  the  general  description  of  the.se  deter- 
ministic models,  the  models  are  not  flexible  and.  in  general,  are  not  realistic  models  to 
describe  a distributed  data  base  in  the  real  world. 

Stochastic  models  allow  for  some  parameters  to  vary  or  to  be  unknown.  In  a discrete 
time  stochastic  model  the  system  is  monitored  at  a sequence  of  times,  and  is  then  reorga- 
nized based  upon  estimates  and  other  available  infonnation  to  optimize  the  system  perfonn- 
ance.  An  example  of  this  situation  is  when  the  distributed  data  ba.se  under  consideration  is 
linked  together  by  asynchronous  communication  lines.  In  a continuous  time  stochastic 
model,  events  are  allowed  to  occur  at  any  time.  For  example,  a file  usage  rate  may  be  un- 
known and  continuously  varying  with  time.  Note  that  stochastic  models  allow  for  a 
changing  environment  which  is  not  known  a priori.  Thus,  it  seems  that  stochastic  models 
provide  a natural  environment  for  describing  distributed  data  bases. 
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III.  DISTRIBUTED  DATA  BASE  MODELS  - EXAMPLES 

111  tliis  section  a brief  summary  of  the  models  used  to  describe  the  file  allocation 
problem  in  distributed  data  bases  is  given.  I he  models  will  be  described  chronologically  with- 
in the  breakdown  given  in  the  previous  section. 


Deterministic  — One-phase 

The  original  work  in  file  allocation  was  by  Clni.l  *1  He  considered  the  following 

zero-one  programming  model.  C onsider  a network  with  i = I n computers  and  j = 1 

m files.  Let  Xjj  indicate  the  .j^*'  file  is  stored  on  the  i**'  computer. 

I 1 j**'  file  stored  in  i*'*  computer 
^'j  I 0 otherwise 


Let  r:  be  the  number  of  redundant  cojiies  of  file  j.  The  following  constraint  is  reijuired 


Let  Lj  be  the  length  of  the  j'*’  file  and  bj  be  the  available  memory  size  of  the  i'*'  computer. 

I'hen  the  memory  constraint  implies 
n 

J=1 

l.et  ajji^  denote  the  expected  time  for  the  i'*’  eomputer  to  receive  the  j'*’  file  from  the  k**^ 
computer,  l.et  I'jj  be  the  maximum  expected  allowable  retrieval  time  of  the  j*''  file  to  the 
i'^'  computer.  Then  ajj]^  must  be  less  than  I jj. 


;th 


( I Xjj)  Xj.  j ajj|.  < I ij 

Note  that  if  rj  = 1 then  Xj,  X|^j  = 0 for  i ^ k and  the  above  ec|uation  reduces  to 


Ni  'ki 


^kj  ‘‘ilk  ^ 'ii 


In  many  of  the  models  that  follow  it  is  assumed  that  ajj|.,  is  a constant,  but  in  this  case  ajj|^ 
is  estimated  as  a function  of  the  file  access  rate  and  line  transmission  speed.  The  structure  of 
ajji^  assumes  that 


_ ( I ) 
'ik 


C) 


a 1 = W + tt.:  + w,  . 
•'ilk  "ik  kj  ki 


where  w|||  ’ is  the  expected  ciueiiing  delay  at  the  i'*’  computer  for  the  channel  to  the  k”’ 


C)  . 


.'omputer.  wj^""'  is  the  expected  (|ueuing  delay  at  the  k’^'  computer  for  this  channel  to  the 
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computer,  and  is  the  expected  computer  access  time  to  the  j**’  file.  The  super- 
scripts indicate  priority  classes  of  messages.  To  simplify  the  analysis  it  is  assumed  that  t^.j 
is  small  compared  to  the  queuing  delays,  and,  further,  that  the  delay  of  the  short  high 

priority  request  messages  is  much  shorter  than  the  longer  low  priority  reply  messages. 

(T) 

Thus  ajji^  is  approximated  by  W|_“  using  a queuing  model  with  Poisson  arrivals  of  re- 
quests at  a rate  Xj|^  between  the  i**’  and  computer.  To  complete  this  queuing  model  the 
following  parameters  are  needed:  Ij  the  length  of  the  file  portion  sent  in  response  to  the 
query.  Note  Ij  < Lj.  The  average  time  to  transmit  the  reply  from  the  k*'’  to  the  i**'  com- 
puter is  l/Mjk  The  request  rate  for  thej**'  file  at  the  i*’'  computer  is  Ujj. 

Then 

m 

'Mk  " X “iJ  * 

J=1 

Let 

1/Mj=  Ij/K 

where  R is  the  transmission  rate  from  the  k**’  to  the  i^*'  computer.  Then  l/^j  is  the  trans- 
mission time.  The  average  time  required  to  transmit  a reply  is  then 
m 

1 /Mjt;  “ ( 1 ^ ^ij  * ' ~^ij  * ^kj' 

j=l 

The  traffic  intensity  is  defined  to  be 
m 

^ik  - ^ik''^ik  ~ ^ ^ij*  '“^ij* 

J=1 

Then  queuing  theory  for  a Poisson  arrival,  constant  service  time  model  implies  the  waiting  time 
'^k‘i*"^ik^*‘''ik<'-^ik>*  fori^k 

Taking  the  formulas  above,  combining  them  with  the  restriction  wj,^*  < Tjj  implies 
( 1-Xjj)  X|^j  Xji^  - 2Mjk  (Mjk  “ ^ik*  T^ik  ^ ® 

The  above  define  all  the  constraints  in  Chu’s  model.  All  that  needs  to  be  defined  now  is  the 
objective  function.  The  cost  of  the  model  to  be  minimized  involves  storage  and  transmis- 
sion costs. 

C = C + f' 

storage  transmission 
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when' 


^ storage  ^ ^ ij  * j ■'"ij 

i.j 


and 


^ transmission  = ik  ^*ii  ^kj  * ' ^ij*  ' j ^ ^ ^ ik  S '^Vl  "'^kj  * ij  ' 


i.j.k 


i.J.k 


rite  first  lonu  in  lire  iransinisston  >:osi  represents  costs  due  to  file  request  and  the  second 
term  represents  ci'sts  due  to  file  update  where  is  the  freciueney  of  modification  of  file  j 
on  computer  i.  for  multiple  copies  of  files  this  model  is  not  linear,  hut  since  the  protrrain- 
miny  problem  l^  a zero-one  problem  there  are  standard  methods  for  lineari/.inji  the  model 
which  requires  additional  constraints,  flic  model  is  not  ver\  efficient,  f or  a simplified 
three  computer,  five  file,  one  cop\  network  the  time  to  find  the  optimum  solution  was  25 
seconds  on  an  IIJM  .5()0  (>5. 

U hitnev  ^ ^ considers  the  follow inu  problem  which  is  of  the  one-phase  deter- 
ministic \ariet\ . l.et  (l(  f.l.i  be  the  eraph  of  the  network  on  vs  Inch  the  distributed  data  base 
IS  located,  where  1 is  the  set  of  nodes  which  represents  the  locations  in  the  network  where 
the  terminal  users  are  localeil  and  I represents  the  edges  of  the  graph  which  are  eommunica- 
tion  lines  between  user  nodes.  ( omnuinicalioii  costs  are  given  as  weights|g|j^|mel,}  on  the 
edges  of  the  graph  l et  represent  the  weight  of  shortest  path  from  noile  i to  node  i. 
ic  Sjj  represents  the  cost  ot  the  least  e\i'ensi\e  commumeation  route  from  computer  site  i 
to  site  j.  (iiren  the  graph  ( l(  1 . 1 i,  then  an  efficient  algorithm  developed  b>  Hu  deter- 
mines the  path  which  viekN  the  minimum  communication  cost  between  the  two  nodes  of 
interest.  This  minimum  communication  cost,  .S-,  is  mven  as 

Sjj  = miiN  g|„.  where  I’u  is  a path  from  node  i to  node  j. 

’’ll 

The  next  (|uanlil\  eoiisiilered  in  this  model  is  the  message  traffic  from  user  terminal 
1 to  file  k.  which  IS  represented  as7j(k)  and  given  by 

7j(k)  = l’|(kl  Kiki 

where  l’,(ls)  is  the  probability  that  file  k is  requested  from  site  i aiul  l<(kl  is  the  rate  at  which 
some  rccoril  ot  file  k is  reiiuesteil.  .Ml  records  of  a file  are  homogeneous,  and.  therefore. 

K(  k 1 represents  the  rate  of  request  for  each  record  of  file  k.  1 he  quantity  7i(  k)  corresponds 
to  the  rate  ot  message  traffic  for  records  of  file  k which  is  generated  by  user  tenninal  i. 

Now  the  cost  to  minimize  firr  the  assignment  of  file  k is 


(1 


min  7j(  k)  S- 
.eT 

Till.'  cost  of  yssigning  file  k to  site  i is  jiiven  by 

t(  k,i»  = ^ 7j(  k)  Sjj 
if  I 

ami  is  a minimum  when 
ttk.j)  < ttk.it 


for  every  le  1.  .Mtliou.cli  lun  explicitly  nientiimed  by  Whitney,  it  appears  that  each  of  the 
costs  t(kj).  ie  l . must  be  computei.1  and  then  a cc'inparison  of  the  costs  must  be  done  in 
order  that  the  minimum  ci>st  be  I'ound.  The  model  is  also  yenerali/eil  slightly  by  alknving 
multiple  copies  of  a file  to  exist  in  the  data  base. 

Hie  following  paper  b\  C asey  ^ ^ presents  a simpler  model  than  the  above  models  in 
most  ways  except  that  it  treats  the  number  of  copies  ot'each  file  as  a variable.  C asey 
assumes  the  files  are  independent  of  each  other  and  thus  may  optimize  one  file  at  a time. 

I hree  costs  are  considered:  storage,  ipiery.  and  update.  I'he  querv  and  update  costs  consist 
of  communication  costs.  I hus  if  I is  an  index  set  representing  tlie  computers  on  which  a 
given  file  resides  the  cost  C'tl ) is  given  by 


( (I) 


1=1  \kel 


Oj  d;,.  + X|  min  dji, 
kel 


' kfl 


where 

is  the  storage  cost  at  the  k’*’  computer 
dji^  is  the  cost  of  communication  from  node  i to  node  k lor  a (|uery 
dji^  is  the  cost  of  communication  from  node  j to  node  k for  an  update 
Xj  is  the  volume  of  (|uer\  traffic  emanating  from  node  j 
i^j  is  the  volume  ol  update  traffic  emanating  from  node  j 


Note  that  so  that  the  problem  is  nontrivial.  Using  this  model  Casey  proves  some 

theorems  about  the  optimal  number  of  copies  of  files  in  the  network  and  properties  ol  the 
optimal  file  allocation. 

A later  paper  by  Morgan  and  l.evinl‘^1  considers  a model  more  general  than  Casey's 
and  different  from  Chu's.  I heir  model  considers  both  programs  and  data  tiles.  I hey  differ- 
entiate between  programs  and  files  because  they  assume  that  in  a heterogeneous  network 
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files  wouKl  be  truiisferahle  ;inionj:  ;ill  the  eDiiiputers  bill  |iro{!rams  would  be  trans- 
ferable only  aiuons’  honu\iieneous  subsets  of  eoni|niters.  In  then  model,  described  below, 
the  indices  i.j.k  refer  to  computers,  I to  tiles  and  p to  programs.  I he  nelsvork  is  assumed  to 
have  N computers.  I ' files  aiul  P programs  (Queries  and  updates  are  assumed  to  be  processed 
through  programs,  l et 

Xjpj- be  tiueiA  traffic  Irom  mnle  i to  lile  f via  program  p 
\jpj-  be  update  traffic  from  node  i to  file  I via  program  p 
C'jj  be  communication  cost  [ler  i]uer.v  unit  trom  i to  j 
C ’j  be  communication  cost  per  update  unit  from  i to  j 
(jjj-  be  storage  cost  of  tile  1 at  j 
(jjp  be  storage  cost  of  [irogram  p at  i 
a be  e.spansion  tacior  lor  querv  message 
fj  be  e.xpansion  factor  I'or  update  message 


Jp  be  set  of  nodes  wliere  program  p ma\  be  stored 

I'he  expansion  factors  are  the  ratios  ot'  the  length  of  iiuery  (update)  from  a pri'gram  to  the 
length  of  query  (upilate)  trom  the  ('riginalmg  node.  To  define  the  model  the  following 
control  variables  are  rec|uired: 


(1  if  copy  ol 
(0  otherwise 


tile  I is  stored  at  noile  k 


Vjp  = i 1 it  copy  ol  program  p is  stored  ,ii  node  i 
(0  otherwise 


•’‘ikt  ~ if  transactions  from  node  i to  tile  f are  routed  to  node  k 
V)  otherwise 


x*  - (I  i 

to  c 


it  transactions  from  node  i to  tile  fare  routed  to  node  i via  program  p 
otherwise 


file  following  two  parameters  define  traftic  flow: 

Pjf  =N  Xjpi  is  (|uer\  traffic  to  file  f processeil  at  notie  j 


is  update  traftic  to  file  f processed  at  node  i 


'^)f“_'Mpl 

i-P 


H 


Tlie  model  may  now  he  described.  Tlie  objective  is  to  minimize 

C=  ^ ^ipf*^  ij*'ijp  “ Communication  cost  of  queries  from  initiating  nodes 
t'j  i p to  the  programs 

+ ^ipt  ij*^ijp*  ~ Communication  cost  of  updates  from  initiating  nodes 

t.j  i p to  the  programs 

v 

^ zL  ^if*^Sk*^jkf  “ C ommunication  cost  ol  (jueries  trom  programs  to  files 
f.j.k 

+ ^ j|<*^  kf  ~ Communication  cost  ot' updates  from  programs  to  files 

fj.k 


°kf*ykf 


Storage  cost  of  files 


IP  - IP 


= Storage  cost  of  programs 


Subject  to  the  following  constraints: 

• To  assure  the  attainment  of  a feasible  solution  there  must  be  at  least  one  copy  ot  each 
file  and  each  program. 


5^yip>i  p=' P 


f=l F 


• To  assure  that  every  tran.saction  to  every  file,  via  every  program  and  from  every  notle, 
will  have  a defined  route: 


y’‘un>l 


i=l N;p=I P;f=l F 


X’^jkf^ 


j=l N;f=l F 
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• To  assure  resiiieiKV  of  the  appropriate  files  ami  proirraius  in  aeeordaiiee  with  the  defined 
routes 

V x‘.  <Nv';„  J=1 N;p=l l>;f=l |- 

i 

Vxji^f  <Nyi,f  k=t N.f=l h. 


• To  assure  that  program  p w ill  reside  onl\  in  a node  at  whieh  it  ean  he  processed 


Vjp  = U 


If-'p-  P = 


And  \'jp  . y \jjp  . Xji^i-are  I'lnary  variahles. 


Deterministic  — Multi-phase 

Levin  and  Morgan^'^  I also  consider  deterministic  multi-phase  models.  Their  model 
is  a generalization  ot  their  one-phase  huhIcI.  Using  the  decomposition  result  mentioned 
above  they  define  a moilcl  ti'  minimize  the  cost  c)f  allocating  one  tile  at  a time.  The  model 
alxne  is  useii  with  the  tollowing  chanees: 


>kt 

K 


_ i\  if  tile  copy  is  assigned  to  node  k at  period  t 


\0  otherwise 

M ~ “ '(■  ‘irl'’ifiary  assignment  of  a file  at  period  t 

K [ = I t|'''  ;irbitrary  arrangement  of  file  assignments  at  periods  1 to  T 

I he  cost  consists  ol  two  parts.  T ii st  is  the  sum  ot  the  operating  costs  ('(  Kj ) over  all 
time  periods  given  by  the  nuulel  m the  previous  section.  The  secoiul  is  the  transition  costs 
to  change  the  file  allocation  from  one  period  to  the  succeeding  period.  To  determine  this 
cost  a little  arlditional  notation  is  necessary.  I et  l.^  denote  the  length  of  the  file  in  storage 
units.  1 et  7 lx-  the  transformation  facli'r  from  siorage  units  to  message  units.  Then  the 
number  of  message  units.  1 reipiired  to  transmit  the  tile  is  given  by 

I ,n  = I sT 

Then  the  transition  cost  from  |ieriod  t-l  to  period  t is  given  by 


l(Kt_| . Ik|  I - ^ 
tf  K, 


in 


min  ( 

ItKt-l 


a 


It) 


riiis  cost  is  determined  by  sending  the  File  over  tlie  most  economical  path.  Combining  the 
above  equations  tite  total  cost  over  T time  periods  is  given  by 
J 

Ci(  Kp)  = ((Kj)  + Ij(Kj_|,Kj) 

t=l 

The  optimal  solution  Kj  is  given  by 

tI(K  j- ) = min  (KK-p) 

K, 

The  final  multi-phase  deterministic  model  to  be  discussed  is  by  Segall.^^’l  Segall  ex- 
tends the  scope  of  this  model  from  deterministic  to  adaptive  models  to  be  described  in  a 
later  section.  To  obtain  the  adaptive  results,  a model  which  is  much  more  restrictive 
than  the  model  of  Levin  and  Morgan  is  considered,  'fhe  following  general  assumptions  are 
made  in  this  model;  there  is  only  one  copy  of  each  file  in  the  network,  files  are  short  so  that 
there  are  no  storage  or  communication  limitations,  the  files  are  requested  according  to  a 
mutually  independent  process,  and  files  may  only  move  from  computer  to  computer  in 
response  to  a request.  These  assumptions  imply  the  files  may  be  considered  independently 
of  each  other.  With  these  assumptions  the  model  may  now  be  defined  for  a network  of  two 
computers.  Let 

Y , _ /l  if  file  stored  at  computer  i at  time  t 

i \0  otherwise 

where  i=  1,2  and  t=  1.2,3,...  and 

|nj(t)  : t = 1,2.  ...f 


be  two  independent  binary  se(|uences  describing  the  file  requests  at  each  computer,  where 
njtt)  = 1 if  there  is  a request  for  the  file  from  computer  i at  time  t and  iijttl  = 0 otherwise. 
Let  aj(t)  he  the  random  rate  of  iijtt).  Let  B(t)  contain  all  past  information  relevant  to  the 
evolution  of  n(  • ). 

Then 

l’|nj(t)=l  : B(t-I)f  = aj(t) 


Let  Cjj  denote  the  cost  of  transmitting  the  file  from  computer  i to  computer  j.  Let  Cj 
denote  the  cost  of  storing  the  file  at  computer  i.  fhen  the  expected  cost  over  all  lime 
periods  is 


T 


i=l  ' i=l 


£ C.iVrf" 


"j"') 


II 


The  control  variables  of  the  process  are 


I it  tieeision  is  to  iratisler  file  to  inemorv  i at  time  I 
0 otherwise 


Uiltl  = I 

1 he  evolution  ot'  the  process  is  (.letineil  In 

'i  I ( 1+  I ) = I ( 1 1 ( 1 -u  i(  I M + ^ I I u I ( I I 
Y il  t+1  ) = V |(t  I u M 1 1 -t  Y ,11  M 1-u  -,( I )i. 


Since  the  tiles  may  onl\  move  m response  to  a reiiuest  the  controls  are  tunctions  ot  the 
't'j  ( • ).  1 hus 

u 1 1 1 ) = Q s 1 1 1 ) s(  t ) n I ( t » 
list  1 1 = Q|  s(  t ) 'i  |(  1 1 n sd). 


rtie  optimi/ation  problem  is  then  to  tiiul  the  control  laws  Qjj(  • ) and  initial  locations  to 
minimi/e  the  expected  cost  ot  the  network  operation.  For  deterministic  models  Segall 
also  considers  the  t'ollowing  deterministic  continuous  time  model,  because  in  a network 
with  more  than  two  computers  the  analysis  is  easier.  Let 

. I > lj|  i = I.: M 

be  M independent  counlme  processes  represeniine  the  file  requests  at  the  individual  com- 
puters. Let  Xjt  t ) be  the  raiulom  rate  of  the  requests.  The  costs  Cj  and  ( jj  are  the  same  as  in 
the  previous  parajiraph.  l luis  over  a period  of  time  T the  total  expected  costs  would  be 


I 

' M 

M 

1 

(•  = 1 

Vc 

|X'|(  t 1 dt  + ^ N ( II  X |(  t-l  dN|(  1 1 

M r?i 

) 

Fhe  controls  of  the  process  are  defined  to  be 

il  tile  in  computer  i at  time  t-  and  rei|uested  by  computer  i^O  at  time  t 
and  decision  is  to  erase  at  i and  store  at  j 
0 otherwise. 

Fhe  dynamics  of  the  files  are  .eiven  by 

d^  jt  t ) = -X  jtt-t  N ftj|l  1 1 dNjt  1 1 + '^  Qj|l  1 1 X jtt-l  dN|l  1 1. 

1^1  l=?^i 

I he  problem  is  then  to  find  the  optimal  controls  ajjt  • I w hich  mmimi/e  the  operation  cost 
of  the  model. 
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Stochastic  — Discrete  Time 

There  are  two  types  of  stochastic  discrete  models.  The  first  is  by  Levin  and 
Morgan  I ^ ^ whicli  is  an  extension  of  their  previously  described  deterministic  models.  The 
second  is  by  Segall^^’^  which,  though  it  has  more  restrictive  assumptions,  is  broader  in 
scope.  This  will  be  clarified  below. 

In  the  model  of  Levin  and  Morgan  they  assume  the  reipiest  rate  Xjpj  and  update  rate 

Xjpj  are  random  variables.  Rather  than  optimizing  Ci(K.*.)  Levin  and  Morgan  optimize 
* 

HC'itK,.)  = min  L('i(Kj) 

!ir 

This  is  eciuivalent  to  optimizing  the  original  tnodel  with  the  reciuest  and  update  replaced  by 
the  expected  values  of  the  request  and  update  rate.  To  obtain  the  optimal  allocation,  the 
request  and  update  rates  are  estimated  statistically  and  substituted  for  the  expected  values 
ot  the  request  and  update  rates.  The  optimal  allocation  is  then  determined  as  in  the  deter- 
ministic cases. 

The  paper  by  Segall  gives  an  adaptive  file  allocation  algorithm  as  opposed  to  the 
static  allocation  algorithm  of  Levin  and  Morgan.  This  is  an  extension  of  his  two-computer 
detenninistic  models  given  above.  The  notation  used  here  will  be  that  given  above.  It  is 
assumed  that  only  one  of  the  time  varying  rates  is  random.  The  system  dynamics  are 

Yjtt+l ) = Y|(t)  1 l-uitti)  + Y2(t)  U|(t) 

Y2(t+1 ) = Yjtt)  U2(t)  + Y2(t)  1 1-U|(t)] 

with  the  controls 

U|(t)  = a2i<>l  V-.(t)  n|(t) 

U2(t)  = «!  -,(t)  Y|(t)  n-itt) 

where,  here  it  is  assumed.  n-.(t)  is  Bernoulli  with  known  rate  ai(t)  and  n|(t»  has  a random 
rate  a|(tl.  The  random  rate  a|(t)  will  be  modeled  as  a finite-state  Markov  process  with 
states 

transition  probabilities 

Pr{a|(t+l)  = p''>  I a,(l)  = p<‘H=qij(t). 
and  initial  distribution 

Pr{a|(  1 ) = p*’^*|=  7r|^.  k = 1 .2 m. 
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Define  tlK*  variahic 


Xk(t»  = 


I if  a |(  1 1 = p* 
0 otlierwiso 


Let  x(tl  he  the  veetor  with  the  eomponents  Xj,  i=l m.  Let  x(t't-l ) be  the  least  squares 

estimate  ot  x(t  I iiiven  jn  |(  1 ) n |(  t-l  l[.  I hen  the  problem  is  to  find  the  optimal  controls 

to  mimmi/e  the  cost  function 


1 

C'  = L.^)|(  l +('|2a2(t)l  V|(i»+  + a|(t  t-hl  (l-Y,(t))J 

t=l 


where  a|(t  t-l  ) is  the  estimate  ol  a|(t)  given  information  up  to  time  I. 


Stochastic  — Continuous  Time 

As  in  the  discrete  case  Segall^*'^  motlels  the  continuous  time  recpiest  rates  as  a 
tinite-state  Markov  process.  The  notation  will  be  the  same  as  the  above  notation  for  the  con- 
tinuous time  deterministic  nudti-phase  model.  The  dynamics  of  the  file  are  again  given  by 

dVjlt)  = -V,(t-)%ajj(tl  d\j(t)  + \ ajjtti  Vj(t-)  dNj(t). 
i=i^i 

I he  problem  is  to  iletermme  the  ily  namic  controls 

Ojiit)  = tt*  (t,\|(s).  Vj  (S).  s < t,  j = I \], 

that  mimmi.'e  the  cost 

C 

Defining 

J(t)  = i if  IjCt)  = 1, 

there  is  a one  to  one  correspondence  between  .!( • I and  Y(  • ).  thus  the  cost  may  be  written 
more  cotnpactly  as 

r - 

t = I / I (t,.l(t),X,(t))  dt 
• ()  - ' 


^ M M 

V ( I'.  jltidt  + V II  V|(t-I  dXjltl 

J=l  1=1  I 


lo  avoid  notational  difficulties  Segall  further  assumes  that  onl\  X|(I(  is  random  and  that 

^i*  ’ *•  deterministic.  I he  raiulom  rate  X | ( • ) is  imuleleil  as  a Unite-state 

Markov  process  with  slates 

p(  I • < pi  - • < . . . . < p*  * 
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and  transition  probabilities 

Pr|X|(t+dt)  = I X|(t)  = = q|^j(t)  dt  + 0 (dt)  k#j 

Pr|X|(t+dt)  = p^*^M  X|(t)  = p‘*^*}  = 1 + (Ji^i^it)  dt  + 0 (dt) 

where 

i^k 

and  initial  distribution 

Pr[X|(0)  = p*‘^'}=  TT,.. 

Define 

n irxi(t)  = p<^^' 

(0  otherwise. 

Then  if  x(t)  is  the  least  squares  estimate  of 

X|^(t)  = lX|(t) x,^,(t)l^ 

and 

P = Ip*" p*"’>)T' 

then  the  best  estimate  X|(t)  of  X|(t)  given  {N  | (s),  s < t} 
is 

X|(t)  = p^x(t). 

Then  the  cost  may  be  written 

C = H L(t.J(t),.x(t))  dt 

where  x{t)  satisfies  the  m-dimensional  recursive  equation 

dx(t)  = [0^(t)x(t)  - P(x(t))p]dt  + dN|(t) 

pT'x(t-) 

where 

0(t)  = (qkj(t)l  k,j=l,2 m 

and 

P(x(t))=  E'"  (g(t)T  x(t)|. 

To  solve  this  model  results  from  the  theory  of  stochastic  processes  are  required. 
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IV  CRITIQUE  OF  MODELS 


Tho  lirsi  rfsiMFL'Ii  m tlio  allotMlion  ol  lik‘>  in  a i.listribuleci  data  base  is  due  to 
Chu.l  ' I Chu  treats  tlie  detenninistie  one-phase  model.  His  work  has  the  most  general 
assumptions  ot  all  the  papers  eonsidereil  herein.  .Xs  the  pa|H'rs  became  more  current  the 
models  became  less  general.  For  example.  ( hu  is  the  only  author  to  consider  a finite 
memor\  limitation.  However,  the  later  papers  consider  more  general  types  of  problems, 
for  example  stochastic  or  multi-phase  models.  In  this  section  a review  of  each  of  the  papers 
by  categi'ry  will  be  given. 


Deterministic  One-phase 

Most  of  the  authors  consuler  inoilels  ot  this  t\  pe  .As  mentumed  above  ( hu's  model 
IS  the  most  general  He  considers  limited  memory,  file  length,  communication  delay  as  a 
lunction  of  the  file  recjiiest  rates,  upilate  rates,  storage  costs,  communication  costs,  and 
multiple  copies.  I hese  tactors  are  used  to  build  a zero-one  programming  problem  which 
can  be  transferred,  using  st.iiul.ird  lechni(|ues.  into  a linear  zero-one  programming  problem. 
This  model  laid  the  framework  for  the  succeeding  models.  The  principal  drawback  of 
Chu's  model  is  that  it  requires  that  all  the  parameters,  eg  number  of  copies  of  files,  request 
rates,  etc.  be  know  n and  constant  throughout  the  use  of  the  system.  Another  drawback  is 
that  the  algorithm  tor  ('ptimizing  the  performance  of  this  model  is  computationally 
very  slow 

Whitney's  model  of  file  allocation  is  only  part  of  an  overall  system  design  model  for 
a network.  The  solution  technic|ue  used  to  allocate  the  files  is  not  very  elegant;  the  cost  of 
assigning  a file  to  each  site  must  be  computed  and  then  the  site  associated  with  the  minimum 
cost  is  allocated  the  file.  In  tirder  to  compute  the  cost  of  assigning  a file  to  a particular  site, 
all  of  the  routes  between  the  proposed  site  and  sites  which  request  the  file  must  be  enu- 
merated. In  any  system  with  a large  number  of  nodes  the  allocation  of  just  a single  file 
would  be  a fomiidable  task  to  perform  because  of  the  problem  of  enumerating  all  the  pos- 
sible routes  between  two  nodes.  In  addition  to  the  enumeration  problem  all  parameters 
associated  with  a file  (request  rates  from  each  uver,  length,  etc)  must  be  known  prior  to 
system  design. 

Casey,  noting  that  ( hu  assumes  the  number  of  copies  of  files  to  be  known,  attacks 
the  problem  of  determining  the  optimal  number  of  copies  of  files  in  a distributed  data 
base.  Though  his  model  is  not  as  general  as  ( hu's.  it  considers  the  costs  of  locating  files  at 
nodes,  communication  costs  of  queries  .iiul  updates  and  volume  of  communication  and  up- 
ilate  traffic  This  model  allows  each  file  to  be  treateil  independently.  Using  these  assump- 
tions Casey  is  able  to  prove  theorems  w hich  determine  the  optimal  number  of  copies  of  files 
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and  procedures  for  efficiently  determining  the  locations  of  the  files.  The  principal  draw- 
backs in  this  work  are  the  limited  assumptions,  independence  of  files  and  unlimited  mem- 
ory, which  are  too  simplified  to  be  realistic. 

The  major  contribution  of  Levin  and  Morgan  is  in  the  extension  of  the  deterministic 
one-phase  model  into  multi-phase  and  stochastic  models,  to  be  discussed  below.  In  the  one- 
phase  detenninistic  model  they  differentiate  between  programs  and  data  files.  This  is  done 
for  two  reasons.  First,  programs  are  not  necessarily  compatible  with  all  computers  in  a net- 
work. while  data  files  can  be  made  compatible.  Second,  programs  can  initiate  requests  for 
other  files  while  the  reverse  does  not  occur.  This  model  also  has  no  memory  limitations. 

The  model  is  more  general  than  Casey's  but  not  more  general  than  Chu's.  fhe  communica- 
tion costs  are  given  and  are  not  a function  of  the  message  traffic.  These  assumptions 
allowed  Levin  to  prove  that  the  file  allocation  problem  can  be  decomposed  to  individual 
file  minimization  problems.  The  optimal  file  allocation  can  be  obtained  by  optimizing 
the  location  of  one  file  at  a time.  Such  results  are  interesting  and  can  extend  the  intuition 
with  regard  to  file  allocation.  But  assumptions  which  neglect  the  interrelationships  of  files 
are  lacking  in  realism. 

The  models  discussed  above  are  all  deterministic  one-phase  models.  This  type  of 
model  assumes  that  all  the  characteristics  of  the  network  are  known  at  design  time  and  that 
they  remain  the  same  thereafter.  This  assumption  is  a severe  restriction.  The  model  described 
in  the  next  section  is  one  generalization. 

Deterministic  Multi-phase 

Levin  and  Morgan  generalize  their  one-phase  model  into  a multi-phase  model. 

The  model  is  deterministic  in  that  the  assumption  is  made  that  for  each  time  period  all  the 
characteristics  of  the  model  are  known.  The  same  basic  model  is  considered  as  in  the  one- 
phase  model  with  the  addition  of  a transition  cost  between  time  periods.  I his  takes  care  of 
the  cost  of  transferring  files  from  their  allocations  at  one  time  period  to  the  next.  The 
authors  do  not  give  an  algorithm  for  solving  this  problem.  They  do  refer  to  another  paper 
in  which  a dynamic  programming  solution  is  discussed.  The  major  drawback  in  this  model 
is  that  the  system  designer  must  know  the  file  usage  in  advance,  as  in  the  one-phase  model. 
Realistic  systems  of  the  future  will  necessarily  have  to  he  adaptive  and  deterministic  models 
will  only  be  useful  for  either  preliminary  analyses  or  finding  approximate  solutions. 

Stochastic  Discrete 

The  stochastic  model  of  Levin  and  Morgan  is  based  on  their  multi-phase  deterministic 
model.  The  equations  are  the  same.  The  only  difference  is  that  they  assume  the  request 
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jnd  uptiulo  rules  ure  ruiuloni  \uriul''k*s  rallu’i  lluii  kiitiwti  (juuntities  1 hus  in  order  lo  opti- 
mize the  ohjeelive  lunelioii  lhe\  miunI  ei»nMdei  the  expeeled  vulue  ol  the  deterministic 
ohjectivc  tunctioii.  LcviiJ  ^ dcrnoiisirjli.'-,  iIkiI  Ilio  optimal  solution  to  the  file  assignment 
reiluees  to  that  ol  estimating  the  first  moment  ot  the  aeeess  rales  ihslrihulions.  I'he  draw- 
haek  to  this  vsork  is  that  the  optimal  solution  is  not  ailaptoe  hut  reipiires  estimation  of  the 
parameters  at  s\siem  design  time 

The  paper  h\  Segall.  though  it  has  mueh  more  resirietive  assumptions  (eonsiders  one 
tile  at  a time,  lixed  eomnuimeation  eosis  and  only  one  eopy  ot  tile  m the  network),  gives  a 
model  whose  solution  to  the  tile  assignment  problem  is  adaptive.  Segall  develops  a finite- 
state  .Markov  process  and  uses  dynamic  programming  lo  obtain  the  optimal  solution.  This 
paper  is  the  only  paper  to  date  which  obtains  an  adaptive  solution  to  the  file  assignment 
problem 

Stochastic  Continuous 

In  the  discrete  case  Segall  solved  the  problem  lor  a two-computer  network.  For  a 
computer  network  with  more  than  two  computers  there  is  a finite  probability  that  more 
than  one  computer  will  request  the  tile  at  a given  time.  In  a continuous  model  the  proba- 
bility ot  that  event  is  zero,  which  simplities  the  analysis,  Segall  again  for  the  stochastic 
continuous  model  derives  an  adaptive  so!  .tion.  The  principal  drawback  to  this  work  is  that 
the  models  are  too  simple  to  be  ol  practical  use.  But  the  work  is  important  because  it  gives 
an  adaptive  solution  to  the  file  optimization  problem  in  a distributed  data  base. 
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V. 


RELATIONSHIPS  USED  IN  MODELING  DISTRIBUTED  DATA  BASES 


Tliis  section  contains  lists  of  assumptions,  parameters,  costs  and  relationships  that 
different  autliors  have  considered  wlien  describing  models  of  distributed  data  bases  and 
computer  networks.  These  lists  provide  a summary  of  pertinent  ideas  when  developing  a 
model  which  describes  a distributed  data  base  in  a computer  network.  I he  four  primary 
categories  considered  when  modeling  a distributed  data  base  are:  ( 1 ) file  information 
and  parameters.  (2)  transmission  characteristics.  (3)  computer  characteristics,  and 
(4)  costs.  Some  overlap  exists  between  the  lists. 

File  Information  and  Parameters 

The  salient  features  which  describe  files  for  the  purpose  of  setting  up  a model  of  a 
distributed  data  base  are  the  number  of  copies  of  each  file,  the  length  of  each  file,  and  the 
rate  or  frequency  at  which  each  of  the  files  is  accessed.  These  features  are  listed  in  Table  1. 

The  number  of  copies  of  individual  files  is  usually  assumed  to  be  known  at  the  time 
of  system  design.  In  addition,  most  models  assume  that  only  one  copy  of  a particular  file 
exists  in  the  distributed  data  base  since  the  analysis  and  modeling  are  simpler  than  in  the 
case  of  multiple  copies.  Another  option  to  choose  from  when  setting  up  the  model  is  to 
let  the  number  of  copies  of  an  individual  file  be  a variable  used  in  the  optimization  procedure. 

For  the  length  of  the  file  one  of  two  choices  is  usually  made.  The  length  of  a file 
is  assumed  to  be  known  or  the  length  of  the  files  assumed  to  be  short.  When  the  length  of 
a file  is  known,  memory  restrictions  are  placed  on  each  of  the  nodes  in  the  distributed  data 
base  and  the  available  memory  at  each  node  is  a restriction  placed  on  the  model  for  optimi- 
zation. The  use  of  short  files  arises  from  several  assumptions.  One  can  argue  that  the  cost 
of  memory  is  inexpensive  and,  therefore,  the  amount  of  storage  needed  at  each  node  is  an 
irrelevant  factor  to  consider  when  optimizing  the  system  performance.  The  fact  that  storage 
capacity  at  computer  sites  may  be  fully  utilized  places  this  argument  upon  untenable  grounds. 
Another  argument  is  more  direct  and  to  the  point;  the  files  are  assumed  to  be  short  be- 
cause this  implies  that  the  files  are  independent  of  each  other  and.  hence,  no  interaction 
between  the  files  takes  place-not  to  mention  the  fact  that  the  analysis  of  the  optimization 
problem  is  easier  to  accomplish. 

Query  and  update  rates  are  other  file  parameters  which  are  considered  when  model- 
ing a distributed  data  base.  For  some  models  query  (request  for  information  only)  rates  and 
update  (change  the  contents  of  the  file  only)  rates  are  combined  into  a single  request  rate. 
Most  models  assume  that  the  rates  are  known  prior  to  system  design.  When  the  rates  are  not 
known,  then  an  estimate  must  be  fonned  in  real  time  and  the  system,  possibly,  has  to  recon- 
figure itself  based  upon  the  estimated  infonnation. 
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Transmission  Characteristics 


The  objective  of  consitlerint;  transmission  characteristics  is  to  determine  the  cost  of 
transferring  information  between  nodes  of  the  distributed  data  base  and  ensure  a rapid  response 
to  queries  and  updates.  Tlie  transmission  characteristics  are  summarized  in  Table  11.  One  of 
tile  concerns  with  transmission  characteristics  between  nodes  is  tlie  time  to  retrieve  and  trans- 
fer a tile  trom  one  node  ot  a network  to  another  node.  In  addition,  constraints  or  priorities 
may  be  placed  upon  the  transmission  channel,  or  the  messages  (file  queries  and  updates)  and 
tile  transfers  may  be  moiieled  as  a queuing  system. 

t he  model  structure  for  determining  the  cost  of  transferring  information  between 
nodes  can  be  simple  or  elaborate.  The  simplest  model  for  transmission  cost  is  to  lump  all 
the  cost  into  one  quantity  which  represents  the  cost  to  transmit  a particular  file  from  one 
node  to  another  node.  Constraints,  sueh  as  maximum  retrieval  and  transfer  time  and  trans- 
mission ehannel  capacity,  are  placed  on  some  of  the  models.  At  the  extretne  of  the  message 
transfer  model  are  elaborate  message  (jueuing  structures  which  take  into  account  the  average 
query  and  update  rates  of  a particular  file,  random  lengths  of  messages,  priorities  on  differ- 
ent types  of  messa,i:es  and  information  transfers,  and  average  message  traffic  between  nodes. 

Computer  Characteristics 

( omputer  characteristics,  listed  in  fable  111.  which  are  relevant  to  modeling  distrib- 
uted data  bases  lor  the  purpose  of  optimal  file  allocation  are  the  amount  of  memory,  file 
access  and  retrieval  time,  and  the  compatibility  of  programs  and  files  on  different  machines. 

I he  primars  com|niler  eharacteristie  considered  is  the  available  memory  at  a site  for  accept- 
ing the  Iransfei  of  a file  As  mentioned  above,  this  option  of  limited  memory  may.  or  ma\' 
not.  be  invoked 

Another  computer  eharacteristie  which  is  considered  in  some  models  is  the  access 
time  re(|uired  to  retrieve  a file  from  a particular  storage  medium,  such  as  disk  access  time  as 
opjiosed  to  tape  drive  access  time  However,  after  breaking  the  cost  (or  time)  of  obtaining 
access  to  file  into  fine  detail,  eg  querv  time  plus  disk  access  time  plus  reply  time,  an  assump- 
tion IS  generally  made  to  simplify  the  analysis.  .An  example  of  a simplifying  assumption  which 
one  model  uses  is  that  the  machine  access  time  is  much  shorter  than  either  the  query  or 
response  message  time  and.  therefore,  the  disk  access  time  can  he  ignored  altogether. 

Another  alternative  is  to  allow  data  to  be  transferred  to  any  node  of  the  network  while 
programs  can  be  translerred  only  to  a limited  subset  of  the  network  nodes.  Two  points 
should  be  noted  here  f irst,  the  term  file  can  be  used  in  a generalized  sense  to  include  a 
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program,  as  well  as  data  files.  Hence,  all  of  the  models  considered  could  be  used  for  both 
program  and  file  allocation  in  a network.  Second,  restricting  the  nodes  of  a network  to  which 
a file  (program  or  data  file)  can  be  transferred  is  usually  a simple  constraint  which  could  be 
placed  on  most  of  the  models  without  increasing  the  complexity  of  analysis.  In  models  which 
assume  that  the  files  are  independent,  restricting  the  allowable  nodes  to  which  a file  can  be 
transferred  imposes  no  additional  constraints,  however,  for  other  models  this  restriction 
may  be  nontrivial. 

Costs 

The  final  category  of  relationships  to  be  considered  when  developing  a model  for  file 
allocation  m a distributed  data  base  is  a collection  of  miscellaneous  costs  (Table  IV).  These 
costs  include  storage  costs,  query  and  update  costs,  reconfiguration  costs,  and  communica- 
tion costs. 

The  only  costs  which  all  models  consider  are  the  cost  of  storing  a file  at  a particular 
node  location  and  the  transmission  cost  of  sending  the  information  of  one  file  from  one 
node  to  another  node.  In  the  case  of  storage  costs  of  a file  at  a particular  site,  a cost  is 
charged  for  the  storage  of  a file  regardless  of  the  memory  restriction.  In  particular,  if  the 
model  assumes  no  restriction  on  memory,  a cost  is  given  to  the  amount  of  storage  that  a 
file  requires.  As  with  message  transmittal  the  transmission  costs  may  be  simple  or  complex 
and,  in  fact,  the  costs  for  transmission  are  based  directly  upon  the  modeling  of  the  message 
transmittals.  The  communications  cost  can  be  divided  into  query  cost  and  update  cost. 

The  query  cost  can  be  further  divided  into  a query  (interrogation)  cost  and  response  cost, 
along  the  same  lines  as  mentioned  above. 

The  cost  of  reconfiguration  or  transition  when  a file  is  moved  to  a new  site  is 
necessary  for  the  deterministic  multi-phase  and  stochastic  models.  This  transition  cost 
includes  only  the  cost  of  sending  the  file  from  one  node  of  the  data  base  to  another  node. 

No  overhead  cost  for  reconfiguration  has  been  provided  in  the  transition  cost 

The  relationships  discussed  above  describe  the  factors  which  have  been  considered 
in  modeling  distributed  data  bases.  Optimal  allocation  of  files  in  a network  is  a fairly  new 
field  of  research  and  the  relationships  listed  should  not  be  considered  to  exhaust  the  factors 
which  might  he  needed  to  model  realistic  distributed  data  bases. 
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TABLE  1.  FILE  INFORMATION  AND  PARAMETERS 


1 . Number  ot  copies  of  a particular  file 

a.  Ciiven  at  desiitn  time 

b.  Variable  one  ol  the  parameters  to  be  useil  in  the  opiimi/alion  procedure 

2.  Length  of  file 

a.  Short  no  interaction  between  files 

b.  Known  length 

-v  Request  rates  for  information  contained  within  the  file 

a.  Kate  at  which  a particular  program  re(|uesis  a tile 

b.  Rate  at  which  a node  in  the  network  retiuests  a file 

4.  I'pdate  rates  for  modifying  a file 

5.  Query  rates  for  obtaining  inibrmatioii 
b.  File  dependence 

a.  Independent  of  each  other  no  interaction  between  the  files 

b.  Dependent  upon  each  other 

TABLE  11  TRANSMISSION  C HARAC  TERISTICS 

1 . rime  to  retrieve  file  from  ime  node  of  the  network  to  amither  noile 

2.  Maximum  retrieval  time 
Transmission  channel  capacity 

4.  Message  queuing 

a.  Average  delay  in  sending  request  or  cpiery 

b.  Average  delay  in  receiving  a reply  from  a query  after  it  has  been  sent 

c.  Poisson  arrivals  of  messages  request  and  reply 

5.  Random  lengths  of  messages 

6.  Priorities 

a.  Short  reiiuests  high  priority 

b.  l ong  replies  low  priority 

7.  Rate  of  message  traffic  from  file  to  user 
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TABLE  III.  COMPUTER  CHARACTERISTICS 


1 . Memory 

a.  I'inite  amount  of  memory 

b.  No  restriction  on  memory 

2.  File  update  and  retrieval  time 

3.  Programs  only  run  on  specific  machines 

TABLE  IV.  COSTS 

1 . Storage  cost 

a.  File  storage  cost 

b.  Program  storage  cost 

2.  Communications  cost 

3.  Query  cost 

4.  Update  cost 

5.  Reconfiguration  or  transition  cost  when  a file  moves  to  a new  site 

b.  Communication  cost  of  queries 

a.  Program  (user)  to  file 

b.  File  to  user  communication  costs 

7.  Communication  cost  of  updates  (same  as  6a  and  b) 
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VI  SUMMARY 

In  this  paper  v^e  have  examinoii  the  exi>tmj:  distril'iutei.l  data  base  file  alloeation 
models  A breakdown  of  the  models  b\  type  I determimstie  one-phase,  deterministic  multi- 
phase, stochastic  discrete,  stochastic  contmuousi  was  civen.  1 he  relationships  and  identities 
Used  to  describe  the  models  vsere  divided  into  tour  categories:  file  information  and  parame- 
ters, transmission  cliaracteristics.  computer  characteristics,  and  costs.  In  the  investigations 
which  led  to  this  paper  it  was  seen  that  the  models  defined  were  initially  very  general.  1 he 
models  included  relationships  which  were  verv  detailed  m their  description  of  the  file  alloea- 
tion problem.  In  previous  analyses  using  these  models,  simplifications  were  often  made  for 
computational  tractabihty  Many  of  the  assumptions  and  models  ended  up  so  restricted  m 
scope  or  ilet.iil  as  to  be  unre.ihstic  1 here  is  a great  need  tor  more  work  in  this  area 


24 


REFERENCES 


I WW  Clui,  "Optimal  File  Allocation  in  a Computer  Network,”  Computer  Comm  Net- 
works. Prentice  Hall  1973,  p 82-94 

2.  V Whitney.  A Study  of  Optimal  [-'ile  SiU‘  Assignment  Phi)  Thesis.  Univ  of  Mich 
(1970) 

3.  RC'i  Casey.  "Allocation  of  Copies  of  a File  in  an  Information  Network.”  Spring  Joint 
Computer  Conference  1972,  p (>17-625 

4.  ML  Morgan  and  KD  Levin,  "Optimal  Program  and  Data  Locations  in  Computer  Net- 
works.” CACM  Vol  20  ( 1 977).  p 3 1 5-2 1 

5.  KD  Levin  and  HL  Morgan,  "Optimizing  Distributed  Data  Bases  - a Framework  for 
Research,”  National  Computer  Conference  1975.  p 473-478 

6.  A Segall.  “Dynamic  File  Assignment  in  a Computer  Network,”  IFHF  TAC.  Vol  2 1 
(1976).  p 161-173 

7.  KD  Levin,  Organizing  Distributed  Data  Bases  in  Computer  Networks.  Penn  Univ 
Decision  Sci  Dept.  Rept  74-09-01,  1974 


25 


